# You should set "fastBCR" folder as working directory before running pipeline.
# knitr::opts_knit$set(root.dir = "THE/PATH/TO/FASTBCR/FOLDER")
knitr::opts_knit$set(root.dir = "~/Documents/Rpackage/fastBCR-1.1") # Replace with your path.
library(fastBCR)
Loading required package: dplyr
Warning: package ‘dplyr’ was built under R version 4.2.3
Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Loading required package: ggplot2
Loading required package: ggmsa
Registered S3 methods overwritten by 'ggalt':
  method                  from   
  grid.draw.absoluteGrob  ggplot2
  grobHeight.absoluteGrob ggplot2
  grobWidth.absoluteGrob  ggplot2
  grobX.absoluteGrob      ggplot2
  grobY.absoluteGrob      ggplot2
ggmsa v1.4.0  Document: http://yulab-smu.top/ggmsa/

If you use ggmsa in published research, please cite:
L Zhou, T Feng, S Xu, F Gao, TT Lam, Q Wang, T Wu, H Huang, L Zhan, L Li, Y Guan, Z Dai*, G Yu* ggmsa: a visual exploration tool for multiple sequence alignment and associated data. Briefings in Bioinformatics. DOI:10.1093/bib/bbac222
Loading required package: msa
Loading required package: Biostrings
Loading required package: BiocGenerics

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:dplyr’:

    combine, intersect, setdiff, union

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, aperm, append, as.data.frame, basename, cbind, colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find, get,
    grep, grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply, union, unique, unsplit, which.max, which.min

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: ‘S4Vectors’

The following objects are masked from ‘package:dplyr’:

    first, rename

The following objects are masked from ‘package:base’:

    expand.grid, I, unname

Loading required package: IRanges

Attaching package: ‘IRanges’

The following objects are masked from ‘package:dplyr’:

    collapse, desc, slice

Loading required package: XVector
Loading required package: GenomeInfoDb

Attaching package: ‘Biostrings’

The following object is masked from ‘package:base’:

    strsplit

Loading required package: statnet
Loading required package: tergm
Loading required package: ergm
Warning: package ‘ergm’ was built under R version 4.2.3Loading required package: network
Warning: package ‘network’ was built under R version 4.2.3
‘network’ 1.18.2 (2023-12-04), part of the Statnet Project
* ‘news(package="network")’ for changes since last version
* ‘citation("network")’ for citation information
* ‘https://statnet.org’ for help, support, and other information


‘ergm’ 4.6.0 (2023-12-17), part of the Statnet Project
* ‘news(package="ergm")’ for changes since last version
* ‘citation("ergm")’ for citation information
* ‘https://statnet.org’ for help, support, and other information

‘ergm’ 4 is a major update that introduces some backwards-incompatible changes. Please type ‘news(package="ergm")’ for a list of major
changes.

Loading required package: networkDynamic
Warning: package ‘networkDynamic’ was built under R version 4.2.3
‘networkDynamic’ 0.11.4 (2023-12-10?), part of the Statnet Project
* ‘news(package="networkDynamic")’ for changes since last version
* ‘citation("networkDynamic")’ for citation information
* ‘https://statnet.org’ for help, support, and other information

Registered S3 method overwritten by 'tergm':
  method                   from
  simulate_formula.network ergm

‘tergm’ 4.2.0 (2023-05-30), part of the Statnet Project
* ‘news(package="tergm")’ for changes since last version
* ‘citation("tergm")’ for citation information
* ‘https://statnet.org’ for help, support, and other information


Attaching package: ‘tergm’

The following object is masked from ‘package:ergm’:

    snctrl

Loading required package: ergm.count

‘ergm.count’ 4.1.1 (2022-05-24), part of the Statnet Project
* ‘news(package="ergm.count")’ for changes since last version
* ‘citation("ergm.count")’ for citation information
* ‘https://statnet.org’ for help, support, and other information

Loading required package: sna
Warning: package ‘sna’ was built under R version 4.2.3Loading required package: statnet.common

Attaching package: ‘statnet.common’

The following object is masked from ‘package:ergm’:

    snctrl

The following object is masked from ‘package:Biostrings’:

    order

The following object is masked from ‘package:XVector’:

    order

The following object is masked from ‘package:IRanges’:

    order

The following object is masked from ‘package:S4Vectors’:

    order

The following object is masked from ‘package:BiocGenerics’:

    order

The following objects are masked from ‘package:base’:

    attr, order

sna: Tools for Social Network Analysis
Version 2.7-2 created on 2023-12-05.
copyright (c) 2005, Carter T. Butts, University of California-Irvine
 For citation information, type citation("sna").
 Type help(package="sna") to get started.

Loading required package: tsna

‘statnet’ 2019.6 (2019-06-13), part of the Statnet Project
* ‘news(package="statnet")’ for changes since last version
* ‘citation("statnet")’ for citation information
* ‘https://statnet.org’ for help, support, and other information
### Fast BCR clonal family inference
## 1. Data loading
# Path to the "COVID"/"HC" folder in the "example" folder in fastBCR folder.
COVID_folder_path <- "example/COVID"
HC_folder_path <- "example/HC"
# Load files from "COVID_folder_path"/"HC_folder_path" into list.
# The storage format of data can be in "csv", "tsv", or "Rdata" format.
# The compressed files in the above storage format in "7zip", "cab", "cpio", "iso9660", "lha", "mtree", "shar", "rar", "raw", "tar", "xar", "zip", "warc" format can also be read in.
COVID_raw_data_list <- data.load(folder_path = COVID_folder_path, storage_format = "csv")

  |                                                                                                                                                        
  |                                                                                                                                                  |   0%
  |                                                                                                                                                        
  |=============================                                                                                                                     |  20%
  |                                                                                                                                                        
  |==========================================================                                                                                        |  40%
  |                                                                                                                                                        
  |========================================================================================                                                          |  60%
  |                                                                                                                                                        
  |=====================================================================================================================                             |  80%
  |                                                                                                                                                        
  |==================================================================================================================================================| 100%
HC_raw_data_list <- data.load(folder_path = HC_folder_path, storage_format = "csv")

  |                                                                                                                                                        
  |                                                                                                                                                  |   0%
  |                                                                                                                                                        
  |=============================                                                                                                                     |  20%
  |                                                                                                                                                        
  |==========================================================                                                                                        |  40%
  |                                                                                                                                                        
  |========================================================================================                                                          |  60%
  |                                                                                                                                                        
  |=====================================================================================================================                             |  80%
  |                                                                                                                                                        
  |==================================================================================================================================================| 100%
## 2. Data preprocessing
# Preprocessing of raw data to meet the input requirements for clonal family inference.
# The input of the function needs to meet the AIRR standard format (containing at least "sequence_id", "v_call", "j_call", and "junction_aa" information).
# "junction" and "c_call" are optional if you want to plot the evolutionary tree or need isotype related (SHM/CSR) analysis.
# Only productive sequences whose junction amino acid lengths between 9 and 26 are reserved.
# The "count_col_name" parameter represents the column name for the count of each sequence which can be "consensus_count", "duplicate_count" or "umi_count" according to your needs.
# It defaults to "NA" which means the original count of the sequence is not taken into account.
# Sequences with the same "v_call", "j_call" and "junction_aa" are considered to be the same clonotype and are merged into one row in processed data.
# The column "clonotype_count" is the number of sequences belonging to each clonotype.
# The column "clone_count" is the sum of the counts (calculated based on "count_col_name" parameter) of the sequences belonging to each clonotype.
# The column "clone_fre" is the frequency version of "clone_count".
# The information of the sequence with the highest count in each clonotype is retained.
# The list "index_match" saves the original indexes of sequences for each clonotype.
COVID_pro_data_list <- data.preprocess(data_list = COVID_raw_data_list, count_col_name = "consensus_count")

  |                                                                                                                                                        
  |                                                                                                                                                  |   0%
  |                                                                                                                                                        
  |=============================                                                                                                                     |  20%
  |                                                                                                                                                        
  |==========================================================                                                                                        |  40%
  |                                                                                                                                                        
  |========================================================================================                                                          |  60%
  |                                                                                                                                                        
  |=====================================================================================================================                             |  80%
  |                                                                                                                                                        
  |==================================================================================================================================================| 100%
HC_pro_data_list <- data.preprocess(data_list = HC_raw_data_list, count_col_name = "consensus_count")

  |                                                                                                                                                        
  |                                                                                                                                                  |   0%
  |                                                                                                                                                        
  |=============================                                                                                                                     |  20%
  |                                                                                                                                                        
  |==========================================================                                                                                        |  40%
  |                                                                                                                                                        
  |========================================================================================                                                          |  60%
  |                                                                                                                                                        
  |=====================================================================================================================                             |  80%
  |                                                                                                                                                        
  |==================================================================================================================================================| 100%
## 3. BCR clonal family inference
# Fast clonal family inference from preprocessed data.
# The "cluster_thre" parameter represents minimal clustering criteria (minimum number of sequences needed to form a cluster) and defaults to 3.
# For high efficiency, the "cluster_thre" is increased by 1 for every 100,000 entries of input data.
# The "overlap_thre" parameter represents overlap coefficient threshold for merging two clusters, selectable range (0,1) and defaults to 0.1.
# Lower "overlap_thre" may lead to overclustering while higher thresholds may lead to the split of clonal families.
# The "consensus_thre" parameter represents the consensus score threshold for filtering candidates and defaults to 0.8.
# A higher "consensus_thre" means stricter inference of the cluster.
COVID_clusters_list <- data.BCR.clusters(pro_data_list = COVID_pro_data_list, cluster_thre = 3, overlap_thre = 0.1, consensus_thre = 0.8)

  |                                                                                                                                                        
  |                                                                                                                                                  |   0%
  |                                                                                                                                                        
  |=============================                                                                                                                     |  20%
  |                                                                                                                                                        
  |==========================================================                                                                                        |  40%
  |                                                                                                                                                        
  |========================================================================================                                                          |  60%
  |                                                                                                                                                        
  |=====================================================================================================================                             |  80%
  |                                                                                                                                                        
  |==================================================================================================================================================| 100%
HC_clusters_list <- data.BCR.clusters(pro_data_list = HC_pro_data_list, cluster_thre = 3, overlap_thre = 0.1, consensus_thre = 0.8)

  |                                                                                                                                                        
  |                                                                                                                                                  |   0%
  |                                                                                                                                                        
  |=============================                                                                                                                     |  20%
  |                                                                                                                                                        
  |==========================================================                                                                                        |  40%
  |                                                                                                                                                        
  |========================================================================================                                                          |  60%
  |                                                                                                                                                        
  |=====================================================================================================================                             |  80%
  |                                                                                                                                                        
  |==================================================================================================================================================| 100%
## 4. Classification of clustered and unclustered sequences
# Merge all the clustered sequences in each sample into "clustered_seqs".
# Merge all the unclustered sequences in each sample into "unclustered_seqs".
COVID_seqs_list <- Clustered.seqs(pro_data_list = COVID_pro_data_list, clusters_list = COVID_clusters_list)

  |                                                                                                                                                        
  |                                                                                                                                                  |   0%
  |                                                                                                                                                        
  |=============================                                                                                                                     |  20%
  |                                                                                                                                                        
  |==========================================================                                                                                        |  40%
  |                                                                                                                                                        
  |========================================================================================                                                          |  60%
  |                                                                                                                                                        
  |=====================================================================================================================                             |  80%
  |                                                                                                                                                        
  |==================================================================================================================================================| 100%
HC_seqs_list <- Clustered.seqs(pro_data_list = HC_pro_data_list, clusters_list = HC_clusters_list)

  |                                                                                                                                                        
  |                                                                                                                                                  |   0%
  |                                                                                                                                                        
  |=============================                                                                                                                     |  20%
  |                                                                                                                                                        
  |==========================================================                                                                                        |  40%
  |                                                                                                                                                        
  |========================================================================================                                                          |  60%
  |                                                                                                                                                        
  |=====================================================================================================================                             |  80%
  |                                                                                                                                                        
  |==================================================================================================================================================| 100%
### Downstream analysis
## 1. Single cluster analysis
## 1.1 Conserved motifs distribution
# Visualization of multiple sequence alignment (MSA) of junction sequences within a cluster.
# The parameter "index" allows you to choose a cluster for visualization.
# The parameter "type" can be "DNA" for deoxyribonucleic acid or "AA" for amino acid.
COVID_01_clusters = COVID_clusters_list$COVID_01
HC_01_clusters = HC_clusters_list$HC_01
msa.plot(bcr_clusters = COVID_01_clusters, index = 200, type = "DNA")
Coordinate system already present. Adding new coordinate system, which will replace the existing one.

msa.plot(bcr_clusters = COVID_01_clusters, index = 200, type = "AA")
Coordinate system already present. Adding new coordinate system, which will replace the existing one.

## 1.2 Sequence logo
# Visualization of sequence logo of junction sequences within a cluster.
seqlogo.plot(bcr_clusters = COVID_01_clusters, index = 200, type = "DNA")
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as of ggplot2 3.3.4.Coordinate system already present. Adding new coordinate system, which will replace the existing one.

seqlogo.plot(bcr_clusters = COVID_01_clusters, index = 200, type = "AA")
Coordinate system already present. Adding new coordinate system, which will replace the existing one.

## 1.3 Evolutionary tree
# Reconstruct a B cell lineage tree with minimum spanning tree and genotype abundances using ClonalTree.
# The junction of BCR sequences within a cluster will be written as "ClonalFamily_index.fasta" in "ClonalTree/Examples/input" folder.
# ClonalTree uses Python for compilation, so it needs the absolute path of the Python interpreter in the "python_path" parameter.
# ClonalTree returns two files in the "ClonalTree/Examples/output" folder.
# ClonalFamily_index.nk: the reconstructed BCR lineage tree in newick format.
# ClonalFamily_index.nk.csv: a table in csv format, containing the parent relationship and cost.
clonal.tree.generation(bcr_clusters = COVID_01_clusters, index = 200, python_path = "/Users/wangkaixuan/anaconda3/bin/python") # Replace with your path
use default substitution matrix
Parameter setting = useAbundance:  False ; revision:  True ; trim: False
done
# You can use the clonal.tree.plot() function to visualize the evolutionary tree in R.
# The consensus sequence of the cluster is used as the root node of the tree.
# Orange points represent nodes and blue points represent tips.
# Abbreviation of "sequence_id" (last 5 characters) are shown.
# The x-axis shows the absolute genetic distance.
clonal.tree.plot(nk_path = "ClonalTree/Examples/output/ClonalFamily_200.abRT.nk")

## 1.4 Class switch recombination (CSR) network
# Visualization of isotype co-occurrence within clusters within a cluster
# Circle size represents the number of sequences carrying a given isotype.
# Lines connecting two circles indicate the enrichment level of observing switches in the two corresponding immunoglobulin subclasses.
# The enrichment level is the ratio of observed and expected switches if immunoglobulin isotypes are assumed to be independently distributed among cluster.
# Matrix of values of connected edges between clustered sequences in different isotypes is printed.
CSR.cluster.plot(bcr_clusters = COVID_01_clusters, index = 50)
      IGHM IGHD IGHG3 IGHG1 IGHA1 IGHG2 IGHG4 IGHE IGHA2
IGHM     0    0     0     1     1     0     0    0     0
IGHD     0    0     0     0     0     0     0    0     0
IGHG3    0    0     0     0     0     0     0    0     0
IGHG1    1    0     0     0     1     0     0    0     0
IGHA1    1    0     0     1     0     0     0    0     0
IGHG2    0    0     0     0     0     0     0    0     0
IGHG4    0    0     0     0     0     0     0    0     0
IGHE     0    0     0     0     0     0     0    0     0
IGHA2    0    0     0     0     0     0     0    0     0
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2

## 2. Single sample/group analysis
## 2.1 Summary of clusters from a sample
# Summarize the number of clusters, the average size of clusters and the proportion of clustered sequences.
COVID_clusters_summary = Clusters.summary(pro_data_list = COVID_pro_data_list, clusters_list = COVID_clusters_list)
HC_clusters_summary = Clusters.summary(pro_data_list = HC_pro_data_list, clusters_list = HC_clusters_list)
COVID_01_summary = COVID_clusters_summary$COVID_01
print(COVID_01_summary)
$`number of clusters`
[1] 4903

$`average size of clusters`
[1] 8.95

$`number of clustered seqs`
[1] 43862

$`number of all seqs`
[1] 88502

$`proportion of clustered sequences`
[1] "49.56%"
HC_01_summary = HC_clusters_summary$HC_01
print(HC_01_summary)
$`number of clusters`
[1] 798

$`average size of clusters`
[1] 6.74

$`number of clustered seqs`
[1] 5380

$`number of all seqs`
[1] 105364

$`proportion of clustered sequences`
[1] "5.11%"
## 2.2 Visualization of clusters from a sample
# Point diagram showing clusters from a sample where a circle represents a cluster.
# The size and color of the circle represents the size of the cluster.
# The parameter "index" represents the index of the sample for visualization.
Clusters.visualization(pro_data_list = COVID_pro_data_list, clusters_list = COVID_clusters_list, index = 1)

Clusters.visualization(pro_data_list = HC_pro_data_list, clusters_list = HC_clusters_list, index = 1)

## 2.3 V/J gene usage
## 2.3.1 Pie chart: V/J gene
# Pie chart showing the gene usage frequency of clustered sequences.
# The top ten most frequent genes are shown, and the rest are represented by "others".
# The parameter "colname" can be "v_call" for V gene or "j_call" for J gene.
# (1) single sample
COVID_01_clustered_seqs = COVID_seqs_list$clustered_seqs$COVID_01
COVID_01_unclustered_seqs = COVID_seqs_list$unclustered_seqs$COVID_01
HC_01_clustered_seqs = HC_seqs_list$clustered_seqs$HC_01
pie.freq.plot(clustered_seqs = COVID_01_clustered_seqs, colname = "v_call")

pie.freq.plot(clustered_seqs = HC_01_clustered_seqs, colname = "v_call")

# (2) multiple samples in a group.
# All the clustered/unclustered sequences from multiple samples in a group can be merged.
COVID_all_clustered_seqs = NULL
for(i in 1:length(COVID_seqs_list$clustered_seqs)){
  COVID_all_clustered_seqs = rbind(COVID_all_clustered_seqs, COVID_seqs_list$clustered_seqs[[i]])
}
COVID_all_unclustered_seqs = NULL
for(i in 1:length(COVID_seqs_list$unclustered_seqs)){
  COVID_all_unclustered_seqs = rbind(COVID_all_unclustered_seqs, COVID_seqs_list$unclustered_seqs[[i]])
}
HC_all_clustered_seqs = NULL
for(i in 1:length(HC_seqs_list$clustered_seqs)){
  HC_all_clustered_seqs = rbind(HC_all_clustered_seqs, HC_seqs_list$clustered_seqs[[i]])
}
HC_all_unclustered_seqs = NULL
for(i in 1:length(HC_seqs_list$unclustered_seqs)){
  HC_all_unclustered_seqs = rbind(HC_all_unclustered_seqs, HC_seqs_list$unclustered_seqs[[i]])
}
pie.freq.plot(clustered_seqs = COVID_all_clustered_seqs, colname = "v_call")

pie.freq.plot(clustered_seqs = HC_all_clustered_seqs, colname = "v_call")

## 2.3.2 Heatmap: V-J gene pair
# Heatmap showing the V-J gene pair frequency of clustered sequences
# (1) single sample.
vjpair.sample.plot(clustered_seqs = COVID_01_clustered_seqs)
Using Freq as value column: use value.var to override.

# (2) multiple samples in a group.
vjpair.sample.plot(clustered_seqs = COVID_all_clustered_seqs)
Using Freq as value column: use value.var to override.

vjpair.sample.plot(clustered_seqs = HC_all_clustered_seqs)
Using Freq as value column: use value.var to override.

## 2.4 Junction length distribution
# Histogram and density plot showing the length distribution of junction amino acid of clustered sequences.
# (1) single sample
len.sample.plot(clustered_seqs = COVID_01_clustered_seqs)

# (2) multiple samples in a group
len.sample.plot(clustered_seqs = COVID_all_clustered_seqs)

# Density ridges showing the length distributions of junction amino acid of clustered sequences and unclustered sequences.
# Statistical comparisons are carried out by the two-sided Wilcoxon rank-sum test.
# (1) single sample
len.clustered.plot(clustered_seqs = COVID_01_clustered_seqs,
                   unclustered_seqs = COVID_01_unclustered_seqs)
[1] "p-value: <2e-16"
Warning: Ignoring unknown parameters: `size`

# (2) multiple samples in a group
len.clustered.plot(clustered_seqs = COVID_all_clustered_seqs,
                   unclustered_seqs = COVID_all_unclustered_seqs)
[1] "p-value: <2e-16"
Warning: Ignoring unknown parameters: `size`

## 2.5 Class switch recombination (CSR) network
# Visualization of isotype co-occurrence within clusters from a sample.
# Circle size represents the number of sequences carrying a given isotype.
# Lines connecting two circles indicate the enrichment level of observing switches in the two corresponding immunoglobulin subclasses.
# The enrichment level is the ratio of observed and expected switches if immunoglobulin isotypes are assumed to be independently distributed among cluster.
# Matrix of values of connected edges between clustered sequences in different isotypes is printed.
CSR.sample.plot(bcr_clusters = COVID_01_clusters)
      IGHM IGHD IGHG3 IGHG1 IGHA1 IGHG2 IGHG4 IGHE IGHA2
IGHM     0   91    39   465   256    79     1    0    51
IGHD    91    0     4    45    28     8     0    0     7
IGHG3   39    4     0    96    41    32     0    0     6
IGHG1  465   45    96     0   332   165     1    1    50
IGHA1  256   28    41   332     0    80     1    0   108
IGHG2   79    8    32   165    80     0     2    0    56
IGHG4    1    0     0     1     1     2     0    0     0
IGHE     0    0     0     1     0     0     0    0     0
IGHA2   51    7     6    50   108    56     0    0     0

CSR.sample.plot(bcr_clusters = HC_01_clusters)
      IGHM IGHD IGHG3 IGHG1 IGHA1 IGHG2 IGHG4 IGHE IGHA2
IGHM     0  240    11    25    58    51     1    0    36
IGHD   240    0     5    32    42    32     1    0    25
IGHG3   11    5     0    12    16    19     1    0     8
IGHG1   25   32    12     0    72    58     1    0    34
IGHA1   58   42    16    72     0    97     2    0    75
IGHG2   51   32    19    58    97     0     4    0    72
IGHG4    1    1     1     1     2     4     0    0     2
IGHE     0    0     0     0     0     0     0    0     0
IGHA2   36   25     8    34    75    72     2    0     0

### 2.6 Neutralizing antibody (NAb) sequence query
## 2.6.1 Load the public antibody database to get the information of neutralizing antibody (NAb) sequences.
# Here is an example of the Coronavirus Antibody Database (CoV-AbDab).
CoV_AbDab = read.csv("example/CoV-AbDab_130623.csv")
## 2.6.2 NAb query
# Query the corresponding sequence from the public antibody database.
# The parameter "method" represents the CDRH3 matching method.
# It can be "NA" for perfect match, "hamming" for hamming distance or "lv" for Levenshtein distance. Defaults to "NA".
# The parameter "species" can be "Mouse" or "Human". Defaults to "Human".
# The parameter "maxDist" represents the maximum distance allowed for matching when the argument "method" is "hamming" or "lv". Defaults to "NA".
# example to perfect match
human_perfect_match <- NAb.query(bcr_clusters = COVID_01_clusters, AbDab = CoV_AbDab, method = NA, maxDist = NA, species = "Human")
[1] "The query result has 26 rows."
head(human_perfect_match)
# example with perfect match in "Mouse" species
mouse_perfect_match <- NAb.query(bcr_clusters = COVID_01_clusters, AbDab = CoV_AbDab, method = NA, maxDist = NA, species = "Mouse")
[1] "The query result is empty."
# example with fuzzy matching with "hamming" method and max distance of 1
human_hamming_1_match <- NAb.query(bcr_clusters = COVID_01_clusters, AbDab = CoV_AbDab, method = "hamming", maxDist = 1, species = "Human")
[1] "The query result has 37 rows."
## 3. Inter group analysis
## 3.1 V/J gene usage
## 3.1.1 Boxplot: V/J gene
# Boxplot showing the V/J gene usage of the clustered sequences between two groups.
# The parameter "colname" can be "v_call" for V gene or "j_call" for J gene.
# Statistical comparisons are carried out by the two-sided Wilcoxon rank-sum test.
gene.fre.plot(group1_seqs_list = COVID_seqs_list,
              group1_all_clustered_seqs = COVID_all_clustered_seqs,
              group1_label = "COVID-19",
              group2_seqs_list = HC_seqs_list,
              group2_all_clustered_seqs = HC_all_clustered_seqs,
              group2_label = "HC",
              colname = "v_call")

gene.fre.plot(group1_seqs_list = COVID_seqs_list,
              group1_all_clustered_seqs = COVID_all_clustered_seqs,
              group1_label = "COVID-19",
              group2_seqs_list = HC_seqs_list,
              group2_all_clustered_seqs = HC_all_clustered_seqs,
              group2_label = "HC",
              colname = "j_call")

## 3.1.2 Heatmap: V-J gene pair
# Heatmap showing the fold change of V-J gene pair frequency of clustered sequences between two groups.
# Log fold change (log FC) is calculated as the log2 ratio of the average values between group1 and group2 samples.
# Statistical comparisons are carried out by the two-sided Wilcoxon rank-sum test.
# FDR correction was performed with the Benjamini–Hochberg procedure.
# The V-J gene pair frequency of samples with a frequency less than the minimum value (1‰) is set to the minimum value.
vjpair.group.plot(group1_seqs_list = COVID_seqs_list,
                  group1_all_clustered_seqs = COVID_all_clustered_seqs,
                  group1_label = "COVID-19",
                  group2_seqs_list = HC_seqs_list,
                  group2_all_clustered_seqs = HC_all_clustered_seqs,
                  group2_label = "HC")

## 3.2 Junction length
# Density ridges showing the length distribution of junction amino acid of clustered sequences between two groups.
# Statistical comparisons are carried out by the two-sided Wilcoxon rank-sum test.
len.group.plot(group1_all_clustered_seqs = COVID_all_clustered_seqs, group1_label = "COVID-19",
               group2_all_clustered_seqs = HC_all_clustered_seqs, group2_label = "HC")
[1] "p-value: <2e-16"
Warning: Ignoring unknown parameters: `size`

## 3.3 Diversity analysis
## 3.3.1 Number and size of clusters
# Bubble plot showing the size and number of clusters between two groups.
clu.size.plot(clusters_list1 = COVID_clusters_list, group1_label = "COVID-19",
              clusters_list2 = HC_clusters_list, group2_label = "HC")
[1] "p-value: 1.1e-15"

## 3.3.2 Tcf20 score
# Tcf20 score represents the proportion of sequences attributed to the top 20 clonal families out of the total number of BCR sequences.
# Boxplot showing the Tcf20 scores between two groups.
# Statistical comparisons are carried out by the two-sided Wilcoxon rank-sum test.
Tcf20.plot(clusters_list1 = COVID_clusters_list, group1_label = "COVID-19",
           clusters_list2 = HC_clusters_list, group2_label = "HC")
[1] "p-value: 0.0079"

## 3.4 Somatic hypermutation (SHM) analysis
# Calculate the average SHM ratio of all clusters in each sample.
# The calculation of SHM ratios may take a while.
SHM_df = SHM.calculation(clusters_list1 = COVID_clusters_list, group1_label = "COVID-19",
                         clusters_list2 = HC_clusters_list, group2_label = "HC")
[1] "group1 processing:"

  |                                                                                                                                                        
  |                                                                                                                                                  |   0%
  |                                                                                                                                                        
  |=============================                                                                                                                     |  20%
  |                                                                                                                                                        
  |==========================================================                                                                                        |  40%
  |                                                                                                                                                        
  |========================================================================================                                                          |  60%
  |                                                                                                                                                        
  |=====================================================================================================================                             |  80%
  |                                                                                                                                                        
  |==================================================================================================================================================| 100%[1] "group2 processing:"

  |                                                                                                                                                        
  |                                                                                                                                                  |   0%
  |                                                                                                                                                        
  |=============================                                                                                                                     |  20%
  |                                                                                                                                                        
  |==========================================================                                                                                        |  40%
  |                                                                                                                                                        
  |========================================================================================                                                          |  60%
  |                                                                                                                                                        
  |=====================================================================================================================                             |  80%
  |                                                                                                                                                        
  |==================================================================================================================================================| 100%
# Boxplot showing the SHM ratios between two groups.
# Statistical comparisons are carried out by the two-sided Wilcoxon rank-sum test.
SHM.plot(SHM_df = SHM_df)
[1] "p-value: 0.0079"

# Calculate the SHM ratios of clustered sequences in different isotypes in each sample.
# The calculation of SHM ratios may take a while.
SHM_iso_df = SHM.iso.calculation(clusters_list1 = COVID_clusters_list, group1_label = "COVID-19",
                                 clusters_list2 = HC_clusters_list, group2_label = "HC")
[1] "group1 processing:"

  |                                                                                                                                                        
  |                                                                                                                                                  |   0%
  |                                                                                                                                                        
  |=============================                                                                                                                     |  20%
  |                                                                                                                                                        
  |==========================================================                                                                                        |  40%
  |                                                                                                                                                        
  |========================================================================================                                                          |  60%
  |                                                                                                                                                        
  |=====================================================================================================================                             |  80%
  |                                                                                                                                                        
  |==================================================================================================================================================| 100%[1] "group2 processing:"

  |                                                                                                                                                        
  |                                                                                                                                                  |   0%
  |                                                                                                                                                        
  |=============================                                                                                                                     |  20%
  |                                                                                                                                                        
  |==========================================================                                                                                        |  40%
  |                                                                                                                                                        
  |========================================================================================                                                          |  60%
  |                                                                                                                                                        
  |=====================================================================================================================                             |  80%
  |                                                                                                                                                        
  |==================================================================================================================================================| 100%
# Boxplot showing the SHM ratios of clustered sequences in different isotypes between two groups.
# Statistical comparisons are carried out by the two-sided Wilcoxon rank-sum test.
SHM.iso.plot(SHM_iso_df = SHM_iso_df)
 [1] "COVID-19_IGHD~COVID-19_IGHM p-value: 0.0079" "COVID-19_IGHD~COVID-19_IGHA p-value: 0.0079" "COVID-19_IGHD~COVID-19_IGHG p-value: 0.0079"
 [4] "COVID-19_IGHD~HC_IGHA p-value: 0.0079"       "COVID-19_IGHD~HC_IGHG p-value: 0.0079"       "COVID-19_IGHM~HC_IGHD p-value: 0.0079"      
 [7] "COVID-19_IGHM~HC_IGHM p-value: 0.0079"       "COVID-19_IGHM~HC_IGHA p-value: 0.0079"       "COVID-19_IGHM~HC_IGHG p-value: 0.0079"      
[10] "COVID-19_IGHA~HC_IGHD p-value: 0.0079"       "COVID-19_IGHA~HC_IGHM p-value: 0.0079"       "COVID-19_IGHA~HC_IGHA p-value: 0.0159"      
[13] "COVID-19_IGHG~HC_IGHD p-value: 0.0079"       "COVID-19_IGHG~HC_IGHM p-value: 0.0079"       "COVID-19_IGHG~HC_IGHA p-value: 0.0159"      
[16] "COVID-19_IGHG~HC_IGHG p-value: 0.0317"       "HC_IGHD~HC_IGHM p-value: 0.0317"             "HC_IGHD~HC_IGHA p-value: 0.0079"            
[19] "HC_IGHD~HC_IGHG p-value: 0.0079"             "HC_IGHM~HC_IGHA p-value: 0.0079"             "HC_IGHM~HC_IGHG p-value: 0.0079"            

### 3.5 NAb ratio calculation
## 3.5.1 Load the public antibody database to get the information of neutralizing antibody (NAb) sequences.
# Here is an example of the Coronavirus Antibody Database (CoV-AbDab).
CoV_AbDab = read.csv("example/CoV-AbDab_130623.csv")
# Obtain the IGHV gene, IGHJ gene and CDRH3 of all antibody sequences
v = sapply(strsplit(CoV_AbDab$Heavy.V.Gene, " "), function(x) x[1])
j = sapply(strsplit(CoV_AbDab$Heavy.J.Gene, " "), function(x) x[1])
cdr3 = sapply(strsplit(CoV_AbDab$CDRH3, " "), function(x) x[1])
vjcdr3 = unique(paste(v, j, cdr3))
## 3.5.2 NAb ratio calculation
# NAb ratio is established as an indicator of the proportional prevalence of neutralizing antibody sequences within expanded clonotypes in each sample.
# It is defined as the fraction of the number of NAb sequences within clonal families to the total number of NAb sequences present in each sample.
NAb_ratio_df = NAb.ratio.calculation(pro_data_list1 = COVID_pro_data_list, clusters_list1 = COVID_clusters_list, group1_label = "COVID-19",
                                     pro_data_list2 = HC_pro_data_list, clusters_list2 = HC_clusters_list, group2_label = "HC", NAb_vjcdr3 = vjcdr3)
[1] "group1 processing:"

  |                                                                                                                                                        
  |                                                                                                                                                  |   0%
  |                                                                                                                                                        
  |=============================                                                                                                                     |  20%
  |                                                                                                                                                        
  |==========================================================                                                                                        |  40%
  |                                                                                                                                                        
  |========================================================================================                                                          |  60%
  |                                                                                                                                                        
  |=====================================================================================================================                             |  80%
  |                                                                                                                                                        
  |==================================================================================================================================================| 100%[1] "group2 processing:"

  |                                                                                                                                                        
  |                                                                                                                                                  |   0%
  |                                                                                                                                                        
  |=============================                                                                                                                     |  20%
  |                                                                                                                                                        
  |==========================================================                                                                                        |  40%
  |                                                                                                                                                        
  |========================================================================================                                                          |  60%
  |                                                                                                                                                        
  |=====================================================================================================================                             |  80%
  |                                                                                                                                                        
  |==================================================================================================================================================| 100%
# Boxplot showing the NAb ratios between the two groups.
# Statistical comparisons are carried out by the two-sided Wilcoxon rank-sum test.
NAb.ratio.plot(NAb_ratio_df)
[1] "p-value: 0.025"

---
title: "fastBCR pipeline notebook"
output: html_notebook
---

```{r setup, include=TRUE}
# You should set "fastBCR" folder as working directory before running pipeline.
# knitr::opts_knit$set(root.dir = "THE/PATH/TO/FASTBCR/FOLDER")
knitr::opts_knit$set(root.dir = "~/Documents/Rpackage/fastBCR-1.1") # Replace with your path.
```
```{r}
library(fastBCR)
```

```{r}
### Fast BCR clonal family inference
## 1. Data loading
# Path to the "COVID"/"HC" folder in the "example" folder in fastBCR folder.
COVID_folder_path <- "example/COVID"
HC_folder_path <- "example/HC"
```
```{r}
# Load files from "COVID_folder_path"/"HC_folder_path" into list.
# The storage format of data can be in "csv", "tsv", or "Rdata" format.
# The compressed files in the above storage format in "7zip", "cab", "cpio", "iso9660", "lha", "mtree", "shar", "rar", "raw", "tar", "xar", "zip", "warc" format can also be read in.
COVID_raw_data_list <- data.load(folder_path = COVID_folder_path, storage_format = "csv")
HC_raw_data_list <- data.load(folder_path = HC_folder_path, storage_format = "csv")
```

```{r}
## 2. Data preprocessing
# Preprocessing of raw data to meet the input requirements for clonal family inference.
# The input of the function needs to meet the AIRR standard format (containing at least "sequence_id", "v_call", "j_call", and "junction_aa" information).
# "junction" and "c_call" are optional if you want to plot the evolutionary tree or need isotype related (SHM/CSR) analysis.
# Only productive sequences whose junction amino acid lengths between 9 and 26 are reserved.
# The "count_col_name" parameter represents the column name for the count of each sequence which can be "consensus_count", "duplicate_count" or "umi_count" according to your needs.
# It defaults to "NA" which means the original count of the sequence is not taken into account.
# Sequences with the same "v_call", "j_call" and "junction_aa" are considered to be the same clonotype and are merged into one row in processed data.
# The column "clonotype_count" is the number of sequences belonging to each clonotype.
# The column "clone_count" is the sum of the counts (calculated based on "count_col_name" parameter) of the sequences belonging to each clonotype.
# The column "clone_fre" is the frequency version of "clone_count".
# The information of the sequence with the highest count in each clonotype is retained.
# The list "index_match" saves the original indexes of sequences for each clonotype.
COVID_pro_data_list <- data.preprocess(data_list = COVID_raw_data_list, count_col_name = "consensus_count")
HC_pro_data_list <- data.preprocess(data_list = HC_raw_data_list, count_col_name = "consensus_count")
```

```{r}
## 3. BCR clonal family inference
# Fast clonal family inference from preprocessed data.
# The "cluster_thre" parameter represents minimal clustering criteria (minimum number of sequences needed to form a cluster) and defaults to 3.
# For high efficiency, the "cluster_thre" is increased by 1 for every 100,000 entries of input data.
# The "overlap_thre" parameter represents overlap coefficient threshold for merging two clusters, selectable range (0,1) and defaults to 0.1.
# Lower "overlap_thre" may lead to overclustering while higher thresholds may lead to the split of clonal families.
# The "consensus_thre" parameter represents the consensus score threshold for filtering candidates and defaults to 0.8.
# A higher "consensus_thre" means stricter inference of the cluster.
COVID_clusters_list <- data.BCR.clusters(pro_data_list = COVID_pro_data_list, cluster_thre = 3, overlap_thre = 0.1, consensus_thre = 0.8)
HC_clusters_list <- data.BCR.clusters(pro_data_list = HC_pro_data_list, cluster_thre = 3, overlap_thre = 0.1, consensus_thre = 0.8)
```

```{r}
## 4. Classification of clustered and unclustered sequences
# Merge all the clustered sequences in each sample into "clustered_seqs".
# Merge all the unclustered sequences in each sample into "unclustered_seqs".
COVID_seqs_list <- Clustered.seqs(pro_data_list = COVID_pro_data_list, clusters_list = COVID_clusters_list)
HC_seqs_list <- Clustered.seqs(pro_data_list = HC_pro_data_list, clusters_list = HC_clusters_list)
```


```{r}
### Downstream analysis
## 1. Single cluster analysis
## 1.1 Conserved motifs distribution
# Visualization of multiple sequence alignment (MSA) of junction sequences within a cluster.
# The parameter "index" allows you to choose a cluster for visualization.
# The parameter "type" can be "DNA" for deoxyribonucleic acid or "AA" for amino acid.
COVID_01_clusters = COVID_clusters_list$COVID_01
HC_01_clusters = HC_clusters_list$HC_01
msa.plot(bcr_clusters = COVID_01_clusters, index = 200, type = "DNA")
msa.plot(bcr_clusters = COVID_01_clusters, index = 200, type = "AA")
```
```{r}
## 1.2 Sequence logo
# Visualization of sequence logo of junction sequences within a cluster.
seqlogo.plot(bcr_clusters = COVID_01_clusters, index = 200, type = "DNA")
seqlogo.plot(bcr_clusters = COVID_01_clusters, index = 200, type = "AA")
```

```{r}
## 1.3 Evolutionary tree
# Reconstruct a B cell lineage tree with minimum spanning tree and genotype abundances using ClonalTree.
# The junction of BCR sequences within a cluster will be written as "ClonalFamily_index.fasta" in "ClonalTree/Examples/input" folder.
# ClonalTree uses Python for compilation, so it needs the absolute path of the Python interpreter in the "python_path" parameter.
# ClonalTree returns two files in the "ClonalTree/Examples/output" folder.
# ClonalFamily_index.nk: the reconstructed BCR lineage tree in newick format.
# ClonalFamily_index.nk.csv: a table in csv format, containing the parent relationship and cost.
clonal.tree.generation(bcr_clusters = COVID_01_clusters, index = 200, python_path = "/Users/wangkaixuan/anaconda3/bin/python") # Replace with your path
# You can use the clonal.tree.plot() function to visualize the evolutionary tree in R.
# The consensus sequence of the cluster is used as the root node of the tree.
# Orange points represent nodes and blue points represent tips.
# Abbreviation of "sequence_id" (last 5 characters) are shown.
# The x-axis shows the absolute genetic distance.
clonal.tree.plot(nk_path = "ClonalTree/Examples/output/ClonalFamily_200.abRT.nk")
```


```{r}
## 1.4 Class switch recombination (CSR) network
# Visualization of isotype co-occurrence within clusters within a cluster
# Circle size represents the number of sequences carrying a given isotype.
# Lines connecting two circles indicate the enrichment level of observing switches in the two corresponding immunoglobulin subclasses.
# The enrichment level is the ratio of observed and expected switches if immunoglobulin isotypes are assumed to be independently distributed among cluster.
# Matrix of values of connected edges between clustered sequences in different isotypes is printed.
CSR.cluster.plot(bcr_clusters = COVID_01_clusters, index = 50)
```
```{r}
## 2. Single sample/group analysis
## 2.1 Summary of clusters from a sample
# Summarize the number of clusters, the average size of clusters and the proportion of clustered sequences.
COVID_clusters_summary = Clusters.summary(pro_data_list = COVID_pro_data_list, clusters_list = COVID_clusters_list)
HC_clusters_summary = Clusters.summary(pro_data_list = HC_pro_data_list, clusters_list = HC_clusters_list)
COVID_01_summary = COVID_clusters_summary$COVID_01
print(COVID_01_summary)
HC_01_summary = HC_clusters_summary$HC_01
print(HC_01_summary)
```
```{r}
## 2.2 Visualization of clusters from a sample
# Point diagram showing clusters from a sample where a circle represents a cluster.
# The size and color of the circle represents the size of the cluster.
# The parameter "index" represents the index of the sample for visualization.
Clusters.visualization(pro_data_list = COVID_pro_data_list, clusters_list = COVID_clusters_list, index = 1)
Clusters.visualization(pro_data_list = HC_pro_data_list, clusters_list = HC_clusters_list, index = 1)
```

```{r}
## 2.3 V/J gene usage
## 2.3.1 Pie chart: V/J gene
# Pie chart showing the gene usage frequency of clustered sequences.
# The top ten most frequent genes are shown, and the rest are represented by "others".
# The parameter "colname" can be "v_call" for V gene or "j_call" for J gene.
# (1) single sample
COVID_01_clustered_seqs = COVID_seqs_list$clustered_seqs$COVID_01
COVID_01_unclustered_seqs = COVID_seqs_list$unclustered_seqs$COVID_01
HC_01_clustered_seqs = HC_seqs_list$clustered_seqs$HC_01
pie.freq.plot(clustered_seqs = COVID_01_clustered_seqs, colname = "v_call")
pie.freq.plot(clustered_seqs = HC_01_clustered_seqs, colname = "v_call")
```
```{r}
# (2) multiple samples in a group.
# All the clustered/unclustered sequences from multiple samples in a group can be merged.
COVID_all_clustered_seqs = NULL
for(i in 1:length(COVID_seqs_list$clustered_seqs)){
  COVID_all_clustered_seqs = rbind(COVID_all_clustered_seqs, COVID_seqs_list$clustered_seqs[[i]])
}
COVID_all_unclustered_seqs = NULL
for(i in 1:length(COVID_seqs_list$unclustered_seqs)){
  COVID_all_unclustered_seqs = rbind(COVID_all_unclustered_seqs, COVID_seqs_list$unclustered_seqs[[i]])
}
HC_all_clustered_seqs = NULL
for(i in 1:length(HC_seqs_list$clustered_seqs)){
  HC_all_clustered_seqs = rbind(HC_all_clustered_seqs, HC_seqs_list$clustered_seqs[[i]])
}
HC_all_unclustered_seqs = NULL
for(i in 1:length(HC_seqs_list$unclustered_seqs)){
  HC_all_unclustered_seqs = rbind(HC_all_unclustered_seqs, HC_seqs_list$unclustered_seqs[[i]])
}
pie.freq.plot(clustered_seqs = COVID_all_clustered_seqs, colname = "v_call")
pie.freq.plot(clustered_seqs = HC_all_clustered_seqs, colname = "v_call")
```
```{r}
## 2.3.2 Heatmap: V-J gene pair
# Heatmap showing the V-J gene pair frequency of clustered sequences
# (1) single sample.
vjpair.sample.plot(clustered_seqs = COVID_01_clustered_seqs)
```
```{r}
# (2) multiple samples in a group.
vjpair.sample.plot(clustered_seqs = COVID_all_clustered_seqs)
vjpair.sample.plot(clustered_seqs = HC_all_clustered_seqs)
```
```{r}
## 2.4 Junction length distribution
# Histogram and density plot showing the length distribution of junction amino acid of clustered sequences.
# (1) single sample
len.sample.plot(clustered_seqs = COVID_01_clustered_seqs)
# (2) multiple samples in a group
len.sample.plot(clustered_seqs = COVID_all_clustered_seqs)
# Density ridges showing the length distributions of junction amino acid of clustered sequences and unclustered sequences.
# Statistical comparisons are carried out by the two-sided Wilcoxon rank-sum test.
# (1) single sample
len.clustered.plot(clustered_seqs = COVID_01_clustered_seqs,
                   unclustered_seqs = COVID_01_unclustered_seqs)
# (2) multiple samples in a group
len.clustered.plot(clustered_seqs = COVID_all_clustered_seqs,
                   unclustered_seqs = COVID_all_unclustered_seqs)
```
```{r}
## 2.5 Class switch recombination (CSR) network
# Visualization of isotype co-occurrence within clusters from a sample.
# Circle size represents the number of sequences carrying a given isotype.
# Lines connecting two circles indicate the enrichment level of observing switches in the two corresponding immunoglobulin subclasses.
# The enrichment level is the ratio of observed and expected switches if immunoglobulin isotypes are assumed to be independently distributed among cluster.
# Matrix of values of connected edges between clustered sequences in different isotypes is printed.
CSR.sample.plot(bcr_clusters = COVID_01_clusters)
CSR.sample.plot(bcr_clusters = HC_01_clusters)
```

```{r}
### 2.6 Neutralizing antibody (NAb) sequence query
## 2.6.1 Load the public antibody database to get the information of neutralizing antibody (NAb) sequences.
# Here is an example of the Coronavirus Antibody Database (CoV-AbDab).
CoV_AbDab = read.csv("example/CoV-AbDab_130623.csv")
## 2.6.2 NAb query
# Query the corresponding sequence from the public antibody database.
# The parameter "method" represents the CDRH3 matching method.
# It can be "NA" for perfect match, "hamming" for hamming distance or "lv" for Levenshtein distance. Defaults to "NA".
# The parameter "species" can be "Mouse" or "Human". Defaults to "Human".
# The parameter "maxDist" represents the maximum distance allowed for matching when the argument "method" is "hamming" or "lv". Defaults to "NA".
# example to perfect match
human_perfect_match <- NAb.query(bcr_clusters = COVID_01_clusters, AbDab = CoV_AbDab, method = NA, maxDist = NA, species = "Human")
head(human_perfect_match)
```

```{r}
# example with perfect match in "Mouse" species
mouse_perfect_match <- NAb.query(bcr_clusters = COVID_01_clusters, AbDab = CoV_AbDab, method = NA, maxDist = NA, species = "Mouse")
```
```{r}
# example with fuzzy matching with "hamming" method and max distance of 1
human_hamming_1_match <- NAb.query(bcr_clusters = COVID_01_clusters, AbDab = CoV_AbDab, method = "hamming", maxDist = 1, species = "Human")
```

```{r}
## 3. Inter group analysis
## 3.1 V/J gene usage
## 3.1.1 Boxplot: V/J gene
# Boxplot showing the V/J gene usage of the clustered sequences between two groups.
# The parameter "colname" can be "v_call" for V gene or "j_call" for J gene.
# Statistical comparisons are carried out by the two-sided Wilcoxon rank-sum test.
gene.fre.plot(group1_seqs_list = COVID_seqs_list,
              group1_all_clustered_seqs = COVID_all_clustered_seqs,
              group1_label = "COVID-19",
              group2_seqs_list = HC_seqs_list,
              group2_all_clustered_seqs = HC_all_clustered_seqs,
              group2_label = "HC",
              colname = "v_call")
gene.fre.plot(group1_seqs_list = COVID_seqs_list,
              group1_all_clustered_seqs = COVID_all_clustered_seqs,
              group1_label = "COVID-19",
              group2_seqs_list = HC_seqs_list,
              group2_all_clustered_seqs = HC_all_clustered_seqs,
              group2_label = "HC",
              colname = "j_call")
```


```{r}
## 3.1.2 Heatmap: V-J gene pair
# Heatmap showing the fold change of V-J gene pair frequency of clustered sequences between two groups.
# Log fold change (log FC) is calculated as the log2 ratio of the average values between group1 and group2 samples.
# Statistical comparisons are carried out by the two-sided Wilcoxon rank-sum test.
# FDR correction was performed with the Benjamini–Hochberg procedure.
# The V-J gene pair frequency of samples with a frequency less than the minimum value (1‰) is set to the minimum value.
vjpair.group.plot(group1_seqs_list = COVID_seqs_list,
                  group1_all_clustered_seqs = COVID_all_clustered_seqs,
                  group1_label = "COVID-19",
                  group2_seqs_list = HC_seqs_list,
                  group2_all_clustered_seqs = HC_all_clustered_seqs,
                  group2_label = "HC")
```

```{r}
## 3.2 Junction length
# Density ridges showing the length distribution of junction amino acid of clustered sequences between two groups.
# Statistical comparisons are carried out by the two-sided Wilcoxon rank-sum test.
len.group.plot(group1_all_clustered_seqs = COVID_all_clustered_seqs, group1_label = "COVID-19",
               group2_all_clustered_seqs = HC_all_clustered_seqs, group2_label = "HC")
```
```{r}
## 3.3 Diversity analysis
## 3.3.1 Number and size of clusters
# Bubble plot showing the size and number of clusters between two groups.
clu.size.plot(clusters_list1 = COVID_clusters_list, group1_label = "COVID-19",
              clusters_list2 = HC_clusters_list, group2_label = "HC")
```
```{r}
## 3.3.2 Tcf20 score
# Tcf20 score represents the proportion of sequences attributed to the top 20 clonal families out of the total number of BCR sequences.
# Boxplot showing the Tcf20 scores between two groups.
# Statistical comparisons are carried out by the two-sided Wilcoxon rank-sum test.
Tcf20.plot(clusters_list1 = COVID_clusters_list, group1_label = "COVID-19",
           clusters_list2 = HC_clusters_list, group2_label = "HC")
```
```{r}
## 3.4 Somatic hypermutation (SHM) analysis
# Calculate the average SHM ratio of all clusters in each sample.
# The calculation of SHM ratios may take a while.
SHM_df = SHM.calculation(clusters_list1 = COVID_clusters_list, group1_label = "COVID-19",
                         clusters_list2 = HC_clusters_list, group2_label = "HC")
```

```{r}
# Boxplot showing the SHM ratios between two groups.
# Statistical comparisons are carried out by the two-sided Wilcoxon rank-sum test.
SHM.plot(SHM_df = SHM_df)
```


```{r}
# Calculate the SHM ratios of clustered sequences in different isotypes in each sample.
# The calculation of SHM ratios may take a while.
SHM_iso_df = SHM.iso.calculation(clusters_list1 = COVID_clusters_list, group1_label = "COVID-19",
                                 clusters_list2 = HC_clusters_list, group2_label = "HC")
```

```{r}
# Boxplot showing the SHM ratios of clustered sequences in different isotypes between two groups.
# Statistical comparisons are carried out by the two-sided Wilcoxon rank-sum test.
SHM.iso.plot(SHM_iso_df = SHM_iso_df)
```


```{r}
### 3.5 NAb ratio calculation
## 3.5.1 Load the public antibody database to get the information of neutralizing antibody (NAb) sequences.
# Here is an example of the Coronavirus Antibody Database (CoV-AbDab).
CoV_AbDab = read.csv("example/CoV-AbDab_130623.csv")
# Obtain the IGHV gene, IGHJ gene and CDRH3 of all antibody sequences
v = sapply(strsplit(CoV_AbDab$Heavy.V.Gene, " "), function(x) x[1])
j = sapply(strsplit(CoV_AbDab$Heavy.J.Gene, " "), function(x) x[1])
cdr3 = sapply(strsplit(CoV_AbDab$CDRH3, " "), function(x) x[1])
vjcdr3 = unique(paste(v, j, cdr3))
```

```{r}
## 3.5.2 NAb ratio calculation
# NAb ratio is established as an indicator of the proportional prevalence of neutralizing antibody sequences within expanded clonotypes in each sample.
# It is defined as the fraction of the number of NAb sequences within clonal families to the total number of NAb sequences present in each sample.
NAb_ratio_df = NAb.ratio.calculation(pro_data_list1 = COVID_pro_data_list, clusters_list1 = COVID_clusters_list, group1_label = "COVID-19",
                                     pro_data_list2 = HC_pro_data_list, clusters_list2 = HC_clusters_list, group2_label = "HC", NAb_vjcdr3 = vjcdr3)
```

```{r}
# Boxplot showing the NAb ratios between the two groups.
# Statistical comparisons are carried out by the two-sided Wilcoxon rank-sum test.
NAb.ratio.plot(NAb_ratio_df)
```

